When we (the designers) visualize data, we encode the quantitative information in shapes, color, position, etc. The viewers then have to decode that information. Cleveland and McGill studied what people are able to decode most accurately and ranked them in the following list.
1. Position along a common scale e.g. scatter plot
2. Position on identical but nonaligned scales e.g. multiple scatter plots
3. Length e.g. bar chart
4. Angle & Slope (tie) e.g. pie chart
5. Area e.g. bubbles
6. Volume, density, and color saturation (tie) e.g. heatmap
7. Color hue e.g. newsmap (quantitative data)
See ppt#7 Perception.pdf
Building blocks of a graph include:
1. data
2. aesthetic mapping aes()
- position(x-axis,y-axis)
- color
- fill
- shape (of points)
- linetype
- size
3. geometric object
- histogram: geom_histogram
- points: geom_point for scatter plots, dot plots, etc
- lines: geom_line for time series, trend lines, etc
- boxplot:geom_boxplot
- text: geom_text
4. statistical transformations
5. scales: one scale per mapping
6. coordinate system
7. position adjustments
8. faceting
layer
scale
aes() is not needed for constant valuesdisplaying data distributions of continuous variables
Aim: to display the important features of the data.
histograms and boxplots
violin plot(combination of boxplots and density plots), ridgeline plots…
Features of continous variables :
If data hight skewed, consider transforming (e.g. Box-Cox transformation)
Modeling and testing for continuous variables
bin=30Displays the histograms with equal scales and binwidths to better compare features of variables.
What density estimates show depends greatly on the bandwidth used
\[\text{Density}=\frac{\text{RelFreq}}{\text{Binwidth}}\]
up and down: increase binwidth or decrease bin nums
center
The boxplot is a compact distributional summary, displaying less detail than a histogram or kernel density, but also taking up less space. Boxplots use robust summary statistics that are always located at actual data points, are quickly computable (originally by hand), and have no tuning parameters. They are particularly useful for comparing distributions across groups. - Hadley Wickham
NOT FOR CATEGORICAL VARIABLES (not for discrete data either!)
Boxplot is best to compare distributions by subgroups
Boxplot by group: must have the same scale and could be drawn with their width a function of the size of the group
Boxplot of different variabls: have different scales and each case appears in each boxplot (no need to consider different widths)
disadvantages: can not detect if the distribution is not unimodal
Outliers: over 1.5 times the box length away from the box
Extreme outliers: over 3 times the box length away from the box
can be reordered by statistics (median, sd, mean) reorder()
ggplot()+geom_boxplot() needs to aes(y=var) or aes(x="something",y=var). If not specify y, there will be an error (since it will consider it as x-axis)…
shapiro.test to test normality.
ggplot()+stat_qq() must use a datafame
Modelling and testing for relationships between variables
1. Correlation - Correlation cofficients
2. Regression
3. Smoothing
- loess(local weighted regression)
- spline function
4. Bivariate density estimation
-kde2d;kde;bkde2d
5. Outliers
Features that might be visible in scatterplots:
1. Causal relationships (linear and nonlinear): correlation \(\ne\) causation
2. Association (without being directly causally related)
3. Outliers or groups of outliers
4. Clusters: to assess the possibility of clutering, consider a density estimate.
5. Gaps
6. Barriers (boundaries)
7. Conditional relationship (different relationships for different intervals of x)
i. e.g. a plot of income against age is likely to differ before and after retirement age
add lines or smooths: if you think there is a linear causal relationship
facet_wrap(~var)ggpairs; splom; spmalpha=0.5position="jitter"geom_density_2d()Scatterplot matrices are valuable for identifying bivariate patterns even with quite a ew variables
Modelling and testing for multivariate continuous data
1. Outliers
- interactive graphics are the best approach
2. Regular gaps in the distribution of a variable
3. Clusters of cases
4. Separated groups: whether this means anything depends on the values those cases take on other varialbes, especially categorical ones
- linear models cane be useful in assessing the features
polular for multivariate continuous data
ggparcoord
parallel coordinate plots can include axes for categorical variable as well
When to use:
- to detect a general trend that data follows, and also the specific cases that are outliers
- not a ideal graph to use when there are just categorical variables involved
- to identify trends in specific clusters
- highlight each cluster in a different color using thegroupColumn
- graphing time series data - information stored at regular time intervals
Scales
- std: default value, where it subtracts mean and divides by SD
- robust: subtract median and divide by median absolute deviation
- uniminmax: scale all values so that the minimum is at 0 and maximum at 1
\[y_{ij}=\frac{x_{ij}-min_ix_{ij}}{max_ix_{ij}-min_ix_{ij}}\] - globalminmax: no scaling, original values taken
- center: centers each variable according to the value given in scaleSummary
- centerObs: centers each variable according to the value of the observation given in centerObsID
splineFactor=10Single categorical variables, nominal, ordinal, or discrete.
Graphics: Barchart and Piecharts …
Features:
1. Unexpected patterns of results
* many more of some categories than others
* some may be missing completely
2. Uneven distributions
* bias, domiated by some major trials....
3. Extra categories
* M and F, but also m and f, male and female
4. Unbalanced experiments
* missing or unusable
5. Large numbers of categories
6. many others: refusals, errors, missings...
Nomial data - no fixed category order
Ordinal data - fixed category order
goodfit(table(categoriyvar))geom_bar(x)) vs. binned (geom_col(x,y))
levels function.Modeling and testing for categorical variables
1. Testing by simulation
I. \(\chi^2\) test
2. Evenness of distribution
I. \(\chi^2\) test: the null hypothesis of equally likely probabilities if a random number generator is to be checked directly
3. Fitting discrete distribution
I. \(\chi^2\) test: inspect any lack of fit
chisq.test()
null,alternative
stat="identity" in geom_bar(). If each row is one observation and want to be grouped into bars, no need to do this.A great alternative to a simple bar chart
R built-in base function dotchart(), or geom_point in ggplot
Frequency:
- bar charts
- cleveland dot plots
Proportion/Association:
- Mosaic plots
- Fluctation diagrams
Modelling and testing for multivariate categorical data
1. Contingency tables
- The standard for checking the association of two categorical variables is the \(\chi^2\) test.
2. Associations between categorical variables
3. Binary dependent variables
Lengths are easier to judge and compare than areas, so it is best to use displays where each rectangle has the same width or the same height.
horizontal split into rectangle, then vertical split, then horizontal split
Favorite ~ Age + Music: Age first, mustic next, favorite last.
- independent variables will be split first, then to split dependent variable
Default:
- Age: h
- Music: v
- Favorite: h
Changed to ("v","v","h"): Age v, Music v, Favorite h
No gap = most efficient
Taller thinner rectangles are better
With a large number of combinations, no mosaicplot is likely to work.
Mosaicpots can be very helpful for displaying raw data and they can also be used to support modelling.
Doubledecker plot: good for comparing rates for a binary dependent variable across all possible groupings
label
no relationship example (right side)
deterministic relationship example
geom_bin2d()geom_tile() to graph all cells in the dataframe and color them by their valuegeom_hex(): hexagonal binslike a combination of scatterplots and histograms: allow to compare different parameters while also seeing their relative distributions
can be used for continuous or categorical data (both for axes and fill color)
fluctile()
strongly agree, agree, don’t known(neutral), disagree, strongly disagree
Special Features of time series
1. Data definitions
- Dependence on time
- Have a given order, and individual values are not independent of one another
2. Length of time series
- Time series can be short (annual sales) or long (every minute)
- Sometimes the short-term details of long series can obscure long-term trends, sometimes are of particular interest
- Series on different time scales can be informative
- For longer periods, consider taking final value, value of middle of the period, average value, or weighted average…..or add smooth….or many other options
3. Regular and irregular time series
- regular time series: equally spaced time points. e.g., hourly data, daily data, yearly data….
- irregular time series: e.g., patient’s temperature or blood pressure; political opinion polls are more frequent near elections than at other times
4. Time series of different kinds of variables
- Most are assumed to be of continuous variables, but still have nominal, discrete…
5. Outliers
- not necessarily extreme values for the whole series, maybe unusual in relation to the pattern around
- scales for time series are unusual…
- “The usual principle applies that it is best to draw several displays, zooming in to inspect details that may otherwise be hidden and zooming out to see the overall context”
6. Forecasting
- Two main reasons for studying time series:
- to try to understand the patterns of the past
- to try to forecast the future
7. Seeing patterns
parallel coordinate plots display regular time series well and can be readily used for plotting multiple series
Short, irregular time series: smoothing (spline smooth and so on), condifence interval also helps
df %>% group_by(symbol)%>%
mutate(rescaled_close = 100*close / close[1])
rescale the stock price for each symbol(company) to 100 grouping by symbol.
Overall long-term trend.
geom_smooth() function
- Loess Smoother
A loess smoother does not assume any model
loess: Locally estimated scatterplot smoothing
- non-parametric regression
- not to specify a global function of any form to fit a model to the data, only to fit segments of the data.
- A smooth curve of set of data points obtained with this statistical technique is called a loess curve.
- Advantages: does not require the specification of a function to fit a model to all of the data in the sample; ideal for modeling complex processes for which no theoretical models exist.
- Disadvantages: less efficient use of data than other least squares methods; require fairly large, densely sampled data sets in order to produce good models; does not produce a regression function that is easily represented by a mathematical formula;
Different smoothing parameters span=0.5
Default smoothing parameters 0.75
smoothing parameter 0.1
smaller smoothing parameter, more fluctuate?
use facet on season (day of month, day of week etc.)
Monthly plot
cyclic pattern: if the fluctuations are not of fixed period
geom_point() in addition to geom_line()
geom_point() with geom_line() is one way to detect missing values.Outliers on individual variables can be spotted using boxplots
Bivariate outliers can be spotted using scatterplots
Higher dimensional outliers may be not outliers in lower dimensions.
In boxplot, suggested that more than 1.5 IQR(the interquartile range) outside the hinges(the quartiles). \[\begin{aligned} \text{outliers} < Q1-1.5IQR\\ \text{outliers} > Q3+1.5IQR \end{aligned}\]
Outliers may change if they are grouped by another variable.
Scatterplots and parallel coordinate plots are useful for visualizing multivariate outliers. You could regard points as outliers that are far from the mass of the data, or you could regard points as outliers that do not fit the smooth model well.
adding density estimation and loess smoother can help
parallel coordinate plot to detect outliers
It is more difficult to find categorical outliers than continuous outliers
Fluctuation diagrams can be used to find categorical outliers.
A strategy for dealing with outliers is as follows
1. Plot the one-dimensional distributions of the variables using boxplots. Examine any extreme outliers to see if they are rare values or errors and decide if they should be removed or imputed.
2. For outliers which are extreme on one dimension, examine their values on other dimensions to decide whether they should be discarded or not. Discard values that are outliers on more than one dimension.
3. Consider cases which are outliers in a higher dimensions but not in lower dimensions. Decide whether they are errors or not and consider discarding or imputing the errors.
4. Plot boxplots and parallel coordinate plots by using grouping on a variable to find outliers in subsets of the data.
Some statistics are little affected by outliers, e.g., medians
While some are affected greatly, e.g., mean, scale….
Two alnative extremes: Either keeping it or discarding it
In modelling, robust methods attempt to reduce the effect of outliers by calculating a weighting for each case
Individual extreme values are easy to spot, groups of outliers are more difficult to determine (can be caused by many different reasons)
Graphical displays are useful for finding univariate outliers and bivariate ones
Sploms and parallel coordinate plots can be helpful for studying potential outliers
Only table1 is a tidy data
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rows
table3
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Spread across two tibbles
table4a # cases
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
table4b # population
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
to identify whether is tidy data or not, consider whether variables can be grouped into one single vairable. e.g., male birth of Denmark, male birth of Netherlands can be grouped into one variable said male birth, country…see sample question #18
gather(): key-value pair
key column, which is the name of the variable defined by the values of the column headingsvalue column, which is the name of the value columnrename
t(): converts to a matrix; convert numerical to character if there are any non-numerical columns; works best if data frame has row names which will become column namesgather() and spread()transforming multiple columns
mutate_all,mutate_if,mutate_at
count y-axis should not have decimal, should be integer
pull from GitHub master to Local master and push from Local master to GitHub master
branching
Workflow
1. Creating repo on GitHub (original/master)
2. Clone it once to local master (Make a local copy, next time this step will be a PULL)
3. Create a branch to do your new work(Local branch1)
4. Commit and Push the new branch to GitHub to be reviewed (to GitHub origin/branch1)
5. Submit a pull request, then someone else merges your changes into master (GitHub origin/master)
- merging a pull request: the original author or someone else can do it; important is that communicate with your collaborators and decide how to manage the pull request
6. Your branch is deleted and the new stuff is pulled into your copy of the master branch
- delete the branch locally:git branch -d <branchname>->git fetch -p(stop tracking remote branch)
Local master is the last to know!!
Types of repositories(from your perspective)
- local repository: resides on your computer
- remote repository: resides somewhere else
- origin: the repo that you created or forked on GitHub
- upstream: the original repo of the project that you forked (if you didn’t create it)
The workflow– \(1^{st}\) PR
1. fork repo (once)
2. clone repo (once)
3. cofigure a remote that points to the upstream repository (once)
>git remote add upstream https://...
4. branch
5. work,commit/push
6. submit pull request
7. wait for PR to be merged
>git fetch upstream>git checkout master>git merge upstream/master>git branch -d <branchname>>git fetch -p